Search CORE

45 research outputs found

Performance scalability analysis of JavaScript applications with web workers

Author: Pajuelo González Manuel Alejandro
Verdú Mulà Javier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Web applications are getting closer to the performance of native applications taking advantage of new standard–based technologies. The recent HTML5 standard includes, among others, the Web Workers API that allows executing JavaScript applications on multiple threads, or workers. However, the internals of the browser’s JavaScript virtual machine does not expose direct relation between workers and running threads in the browser and the utilization of logical cores in the processor. As a result, developers do not know how performance actually scales on different environments and therefore what is the optimal number of workers on parallel JavaScript codes. This paper presents the first performance scalability analysis of parallel web apps with multiple workers. We focus on two case studies representative of different worker execution models. Our analyses show performance scaling on different parallel processor microarchitectures and on three major web browsers in the market. Besides, we study the impact of co–running applications on the web app performance. The results provide insights for future approaches to automatically find out the optimal number of workers that provide the best tradeoff between performance and resource usage to preserve system responsiveness and user experience, especially on environments with unexpected changes on system workload.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Speculative Vectorization for Superscalar Processors

Author: Pajuelo González Manuel A. (Manuel Alejandro)
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2005
Field of study

Traditional vector architectures have been shown to be very effective in executing regular codes in which the compiler can detect data-level parallelism, i.e. repeating the same computation over different elements in the same code-level data structure.A skilled programmer can easily create efficient vector code from regular applications. Unfortunately, this vectorization can be difficult if applications are not regular or if the programmer does not have an exact knowledge of the underlying architecture. The compiler has a partial knowledge of the program (i.e. it has a limited knowledge of the values of the variables). Because of this, it generates code that is safe for any possible scenario according to its knowledge, and thus, it may lose significant opportunities to exploit SIMD parallelism. In addition to this, we have the problem of legacy codes that have been compiled for former versions of the ISA with no SIMD extensions, which are therefore not able to exploit new SIMD extensions incorporated into newer ISA versions.In this dissertation, we will describe a mechanism that is able to detect and exploit DLP at runtime by speculatively creating vector instructions for prefetching and precomputing data for future instances of their scalar counterparts. This process will be called Speculative Dynamic Vectorization.A more in-depth study of this technique reveals a very positive characteristic: the mechanism can easily be tailored to alleviate the main drawbacks of current superscalar processors, particularly branch mispredictions and the memory gap. In this dissertation, we will describe how to rearrange the basic Speculative Dynamic Vectorization mechanism to alleviate the branch misprediction penalty based on reusing control-flow independent instructions. The memory gap problem will be addressed with a set of mechanisms that exploit the stall cycles due to L2 misses in order to virtually enlarge the instruction window.Finally, more refinements of the basic Speculative Dynamic Vectorization mechanism will be presented to improve its performance at a reasonable cost.Los procesadores vectoriales han demostrado ser muy eficientes ejecutando códigos regulares en los que el compilador ha detectado Paralelismo a Nivel de Datos. Este paralelismo consiste en repetir los mismos cálculos en diferentes elementos de la misma estructura de alto nivel.Un programador avanzado puede crear código vectorial eficiente para aplicaciones regulares. Por desgracia, esta vectorización puede llegar a ser compleja en aplicaciones regulares o si el programador no tiene suficiente conocimiento de la arquitectura sobre la que se va a ejecutar la aplicación.El compilador tiene un conocimiento parcial del programa. Debido a esto, genera código que se puede ejecutar sin problemas en cualquier escenario según su conocimiento y, por tanto, puede perder oportunidades de explotar el paralelismo SIMD (Single Instruction Multiple Data). Además, existe el problema de los códigos de legado que han sido compilados con versiones anteriores del juego de instrucciones que no disponían de instrucciones SIMD lo cual hace que no se pueda explotar las extensiones SIMD de las nuevas versiones de los juegos de instrucciones.En esta tesis se presentará un mecanismo capaz de detectar y explotar, en tiempo de ejecución, el paralelismo a nivel de datos mediante la creación especulativa de instrucciones vectoriales que prebusquen y precomputen valores para futuras instancias de instrucciones escalares. Este proceso se llamará Vectorización Dinámica Especulativa.Un estudio más profundo de esta técnica conducirá a una conclusión muy importante: este mecanismo puede ser fácilmente modificado para aliviar algunos de los problemas más importantes de los procesadores superescalares. Estos problemas son: los fallos de predicción de saltos y el gap entre procesador y memoria. En esta tesis describiremos como modificar el mecanismo básico de Vectorización Dinámica Especulativa para reducir la penalización de rendimiento producida por los fallos de predicción de saltos mediante el reuso de datos de instrucciones independientes de control. Además, se presentará un conjunto de técnicas que explotan los ciclos de bloqueo del procesador debidos a un fallo en la cache de segundo nivel mediante un agrandamiento virtual de la ventana de instrucciones. Esto reducirá la penalización del problema del gap entre procesador y memoria.Finalmente, se presentarán refinamientos del mecanismo básico de Vectorización Dinámica Especulativa enfocados a mejorar el rendimiento de éste a un bajo coste

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Speculative dynamic vectorization

Author: González Colás Antonio María
Pajuelo González Manuel Alejandro
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

Traditional vector architectures have shown to be very effective for regular codes where the compiler can detect data-level parallelism. However, this SIMD parallelism is also present in irregular or pointer-rich codes, for which the compiler is quite limited to discover it. In this paper we propose a microarchitecture extension in order to exploit SIMD parallelism in a speculative way. The idea is to predict when certain operations are likely to be vectorizable, based on some previous history information. In this case, these scalar instructions are executed in a vector mode. These vector instructions operate on several elements (vector operands) that are anticipated to be their input operands and produce a number of outputs that are stored on a vector register in order to be used by further instructions. Verification of the correctness of the applied vectorization eventually changes the status of a given vector element from speculative to non-speculative, or alternatively, generates a recovery action in case of misspeculation. The proposed microarchitecture extension applied to a 4-way issue superscalar processor with one wide bus is 19% faster than the,same processor with 4 scalar buses to Ll data cache. This speed up is due basically to 1) the reduction in number of memory accesses, 15% for SpecInt and 20% for SpecFP, 2) the transformation of scalar arithmetic instructions into their vector counterpart, 28% for SpecInt and 23% for SpecFP, and 3) the exploitation of control independence for mispredicted branches.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Mapa conceptual global como herramienta para la visión de conjunto de un sistema operativo

Author: López Álvarez David
Pajuelo González Manuel Alejandro
Verdú Mulà Javier
Publication venue
Publication date: 15/06/2009
Field of study

Numerosas asignaturas están formadas por un temario que está totalmente interrelacionado. Al final del curso los estudiantes deberían haber adquirido los conocimientos de cada tema pero, más importante aún, deberían saber cómo interactúan los diferentes temas entre ellos para obtener una visión global de la asignatura. Sin embargo, a menudo los estudiantes se centran en los temas por separado, en parte porque no les ofrecemos herramientas que les ayuden a relacionar las distintas partes del curso. En este trabajo presentamos el uso de un Mapa Conceptual Global (MCG) de una asignatura como recurso docente que ayuda al estudiante a obtener una visión de conjunto de todo el temario. La experiencia ha sido realizada como complemento de una clase de aprendizaje activo en una asignatura de Sistemas Operativos, pero pensamos que puede ser fácilmente aplicable a otros cursos.Peer Reviewe

UPCommons. Portal del coneixement obert de la UPC

Secretaría de Estado de Cultura

Dynamic web worker pool management for highly parallel javascript web applications

Author: Costa Prats Juan José
Pajuelo González Manuel Alejandro
Verdú Mulà Javier
Publication venue: 'Wiley'
Publication date: 01/01/2016
Field of study

JavaScript web applications are improving performance mainly thanks to the inclusion of new standards by HTML5. Among others, web workers API allows multithreaded JavaScript web apps to exploit parallel processors. However, developers have difficulties to determine the minimum number of web workers that provide the highest performance. But even if developers found out this optimal number, it is a static value configured at the beginning of the execution. Because users tend to execute other applications in background, the estimated number of web workers could be non-optimal, because it may overload or underutilize the system. In this paper, we propose a solution for highly parallel web apps to dynamically adapt the number of running web workers to the actual available resources, avoiding the hassle to estimate a static optimal number of threads. The solution consists in the inclusion of a web worker pool and a simple management algorithm in the web app. Even though there are co-running applications, the results show our approach dynamically enables a number of web workers close to the optimal. Our proposal, which is independent of the web browser, overcomes the lack of knowledge of the underlying processor architecture as well as dynamic resources availability changes.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Runahead threads to improve SMT performance

Author: Pajuelo González Manuel Alejandro
Ramirez Garcia Tanausú
Santana Jaria Oliverio J.
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

In this paper, we propose Runahead Threads (RaT) as a valuable solution for both reducing resource contention and exploiting memory-level parallelism in Simultaneous Multithreaded (SMT) processors. Our technique converts a resource intensive memory-bound thread to a speculative light thread under long-latency blocking memory operations. These speculative threads prefetch data and instructions with minimal resources, reducing critical resource conflicts between threads. We compare an SMT architecture using RaT to both state-of-the-art static fetch policies and dynamic resource control policies. In terms of throughput and fairness, our results show that RaT performs better than any other policy. The proposed mechanism improves average throughput by 37% regarding previous static fetch policies and by 28% compared to previous dynamic resource scheduling mechanisms. RaT also improves fairness by 36% and 30% respectively. In addition, the proposed mechanism permits register file size reduction of up to 60% in a SMT processor without performance degradation.Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

Introducing runahead threads

Author: Pajuelo González Manuel Alejandro
Ramírez García Tanausu
Santana Jaria Oliverio J.
Valero Cortés Mateo
Publication venue
Publication date: 01/01/2007
Field of study

Simultaneous Multithreading processors share their resources among multiple threads in order to improve performance. However, a resource control policy is needed to avoid resource conflicts and prevent some threads from monopolizing them. On the contrary, resource conflicts would cause other threads to suffer from resource starvation degrading the overall performance. This situation is especially sensitive for memory bounded threads, because they hold an important amount of resources while long latency accesses are being served. Several fetch policies and resource control techniques have been proposed to overcome these problems by limiting the per-thread resource utilization. Nevertheless, this limitation is harmful for memory bounded threads because it restricts the memory level parallelism available that hides the long latency memory accesses. In this paper, we propose Runahead threads on SMT scenarios as a valuable solution for both exploiting the memory-level parallelism and reducing the resource contention. This approach switches a memory-bounded eager resource thread to a speculative light thread, avoiding critical resource blocking among multiple threads. Furthermore, it improves the thread-level parallelism by removing long-latency memory operations from the instruction window, releasing busy resources. We compare an SMT architecture using Runahead threads (SMTRA) to both state-of-the-art static fetch and dynamic resource control policies. Our results show that the SMTRA combination performs better, in terms of throughput and fairness, than any of the other policies.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

On the problem of evaluating the performance of multiprogrammed workloads

Author: Cazorla Francisco
Fernández Rodríguez José Enrique
Pajuelo González Manuel Alejandro
Santana Jaria Oliverio J.
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Multithreaded architectures are becoming more and more popular. In order to evaluate their behavior, several methodologies and metrics have been proposed. A methodology defines when the measurements for a given workload execution are taken. A metric combines those measurements to obtain a final evaluation result. However, since current evaluation methodologies do not provide representative measurements for these metrics, the analysis and evaluation of novel ideas could be either unfair or misleading. Given the potential impact of multithreaded architectures on current and future processor designs, it is crucial to develop an accurate evaluation methodology for them. This paper presents FAME, a new evaluation methodology aimed to fairly measure the performance of multithreaded processors executing multiprogrammed workloads. FAME reexecutes all programs in the workload until all of them are fairly represented in the final measurements taken. We compare FAME with previously used methodologies showing that it provides more accurate measurements, becoming an ideal evaluation methodology to analyze proposals for multithreaded architectures.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Runahead threads: reducing resource contention in SMT processors

Author: Pajuelo González Manuel Alejandro
Ramírez García Tanausu
Santana Jaria Oliverio J.
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

In this work, we propose Runahead threads as a valuable solution for both exploiting memory-level parallelism and reducing resource contention in simultaneous multithreaded processors.Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

Overhead of the spin-lock loop in UltraSPARC T2

Author: Cakarevic Vladimir
Cazorla Almeida Francisco Javier
Gioiosa Roberto
Nemirovsky Mario
Pajuelo González Manuel Alejandro
Radojković Petar
Valero Cortés Mateo
Verdú Mulà Javier
Publication venue
Publication date: 04/06/2008
Field of study

Spin locks are task synchronization mechanism used to provide mutual exclusion to shared software resources. Spin locks have a good performance in several situations over other synchronization mechanisms, i.e., when on average tasks wait short time to obtain the lock, the probability of getting the lock is high, or when there is no other synchronization mechanism. In this paper we study the effect that the execution of spinlocks create in multithreaded processors. Besides going to multicore architectures, recent industry trends show a big move toward hardware multithreaded processors. Intel P4, IBM POWER5 and POWER6, Sun's UltraSPARC T1 and T2 all this processors implement multithreading in various degrees. By sharing more processor resources we can increase system's performance, but at the same time, it increases the impact that processes executing simultaneously introduce to each other.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC